{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DDQN (Double-Deep Q-Network)\n",
    "### Function approximation Q-learning\n",
    "This tutorial walks through the implementation of deep Q networks (DQNs), \n",
    "an RL method which applies the function approximation capabilities of deep neural networks\n",
    "to problems in reinforcement learning.\n",
    "The model in this tutorial closely follows the work described in the paper \n",
    "[Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461), written by Hado van Hasselt. \n",
    "\n",
    "To keep these chapters runnable \n",
    "by as many people as possible, \n",
    "on as many machines as possible,\n",
    "and with as few headaches as possible, \n",
    "we have so far avoided dependencies on external libraries \n",
    "(besides mxnet, numpy and matplotlib). \n",
    "However, in this case, we'll need to import the [OpenAI Gym](https://gym.openai.com/docs).\n",
    "That's because in reinforcement learning, \n",
    "instead of drawing examples from a data structure, \n",
    "our data comes from interactions with an environment. \n",
    "In this chapter, our environemnts will be classic Atari video games.\n",
    "\n",
    "## Preliminaries\n",
    "The following code clones and installs the OpenAI gym.\n",
    "`git clone https://github.com/openai/gym ; cd gym ; pip install -e .[all]` \n",
    "Full documentation for the gym can be found on [at this website](https://gym.openai.com/).\n",
    "If you want to see reasonable results before the sun sets on your AI career,\n",
    "we suggest running these experiments on a server equipped with GPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "256"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import mxnet as mx\n",
    "from mxnet import nd, autograd\n",
    "from mxnet import gluon\n",
    "from __future__ import print_function\n",
    "import os\n",
    "import random\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from IPython import display\n",
    "import gym\n",
    "import math\n",
    "from collections import namedtuple\n",
    "import time\n",
    "import logging\n",
    "command = 'rm ./data*.log'\n",
    "os.system(command)\n",
    "\n",
    "gpu_n = 0\n",
    "\n",
    "command = 'mkdir data'\n",
    "os.system(command)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary of the algorithm\n",
    "#### Collect samples\n",
    "At the beginning of each episode (one round of the game), \n",
    "reset the environment to its initial state using `env.reset()`. \n",
    "At each time step ``t``, the environment is at `current_state`.\n",
    "With probability $\\epsilon$, apply a random action.\n",
    "Otherwise, apply $argmax_a~ Q(\\phi($ `current_state` $),a,\\theta)$,\n",
    "where $Q$ is parameterized by paramters $\\theta$ and $\\phi(\\cdot)$ is preprocessor.\n",
    "Pass the action through `env.step(action)` to receive next frame, reward and whether the game terminates.\n",
    "Append this frame to the end of the `current_state` and construct `next_state` while removeing $frame(t-12)$.\n",
    "Store the tuple $(\\phi($ `current_state` $), action, reward, \\phi($ `next_ state` $))$ in the replay buffer.\n",
    "\n",
    "#### Update Network\n",
    "* Draw batches of tuples from the replay buffer: $(\\phi,r,a,\\phi')$.\n",
    "* Define the following loss\n",
    "$$\\Large(\\small Q(\\phi,a,\\theta)-r-Q(\\phi',argmax_{a'}Q(\\phi',a',\\theta),\\theta^-)\\Large)^2$$\n",
    "* Where $\\theta^-$ is the parameter of the target network.( Set $Q(\\phi',a',\\theta^-)$ to zero if $\\phi$ is the preprocessed termination state). \n",
    "* Update the $\\theta$\n",
    "* Update the $\\theta^-$ once in a while\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Set the hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-10-13 08:37:05,519] Making new env: AssaultNoFrameskip-v4\n",
      "/home/ubuntu/src/gym/gym/envs/registration.py:17: DeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.\n",
      "  result = entry_point.load(False)\n",
      "[2017-10-13 08:37:05,747] batch_size: 32, num_episode: 10000000, replay_buffer_size: 1000000, replay_start_size: 50000, gamma1: 0.95, gamma2: 0.95, annealing_end: 1000000.0, Target_update: 10000, learning_frequency: 4, lr: 0.00025, frame_len: 4, no_op_max: 7.5, epsilon_min: 0.1, internal_skip_frame: 4, gamma: 0.99, max_frame: 200000000, image_size: 84, rms_eps: 0.01, ctx: gpu(0), skip_frame: 4\n"
     ]
    }
   ],
   "source": [
    "class Options:\n",
    "    def __init__(self):\n",
    "        #Architecture\n",
    "        self.batch_size = 32 # The size of the batch to learn the Q-function\n",
    "        self.image_size = 84 # Resize the raw input frame to square frame of size 80 by 80 \n",
    "        #Trickes\n",
    "        self.replay_buffer_size = 1000000 # The size of replay buffer; set it to size of your memory (.5M for 50G available memory)\n",
    "        self.learning_frequency = 4 # With Freq of 1/4 step update the Q-network\n",
    "        self.skip_frame = 4 # Skip 4-1 raw frames between steps\n",
    "        self.internal_skip_frame = 4 # Skip 4-1 raw frames between skipped frames\n",
    "        self.frame_len = 4 # Each state is formed as a concatination 4 step frames [f(t-12),f(t-8),f(t-4),f(t)]\n",
    "        self.Target_update = 10000 # Update the target network each 10000 steps\n",
    "        self.epsilon_min = 0.1 # Minimum level of stochasticity of policy (epsilon)-greedy\n",
    "        self.annealing_end = 1000000. # The number of step it take to linearly anneal the epsilon to it min value\n",
    "        self.gamma = 0.99 # The discount factor\n",
    "        self.replay_start_size = 50000 # Start to backpropagated through the network, learning starts\n",
    "        self.no_op_max = 30 / self.skip_frame # Run uniform policy for first 30 times step of the beginning of the game\n",
    "        \n",
    "        #otimization\n",
    "        self.num_episode = 10000000 # Number episode to run the algorithm\n",
    "        self.max_frame = 200000000\n",
    "        self.lr = 0.00025 # RMSprop learning rate\n",
    "        self.gamma1 = 0.95 # RMSprop gamma1\n",
    "        self.gamma2 = 0.95 # RMSprop gamma2\n",
    "        self.rms_eps = 0.01 # RMSprop epsilon bias\n",
    "        self.ctx = mx.gpu(gpu_n) # Enables gpu if available, if not, set it to mx.cpu()\n",
    "opt = Options()\n",
    "\n",
    "env_name = 'AssaultNoFrameskip-v4' # Set the desired environment\n",
    "env = gym.make(env_name)\n",
    "num_action = env.action_space.n # Extract the number of available action from the environment setting\n",
    "\n",
    "logger = logging.getLogger()\n",
    "f_name = './data/results_DDQN_%s.log' %(env_name)\n",
    "fh = logging.handlers.RotatingFileHandler(f_name)\n",
    "fh.setLevel(logging.DEBUG)#no matter what level I set here\n",
    "formatter = logging.Formatter('%(asctime)s:%(message)s')\n",
    "fh.setFormatter(formatter)\n",
    "logger.addHandler(fh)\n",
    "\n",
    "manualSeed = 1 # random.randint(1, 10000) # Set the desired seed to reproduce the results\n",
    "mx.random.seed(manualSeed)\n",
    "attrs = vars(opt)\n",
    "ff =(', '.join(\"%s: %s\" % item for item in attrs.items()))\n",
    "logging.error(str(ff))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define the DQN model\n",
    "The network is constructed as three CNN layers and a fully connected added on the top. Furthermore, the optimizer is assigned to the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "DQN = gluon.nn.Sequential()\n",
    "with DQN.name_scope():\n",
    "    #first layer\n",
    "    DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))\n",
    "    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    DQN.add(gluon.nn.Activation('relu'))\n",
    "    #second layer\n",
    "    DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))\n",
    "    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    DQN.add(gluon.nn.Activation('relu'))\n",
    "    #tird layer\n",
    "    DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))\n",
    "    DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    DQN.add(gluon.nn.Activation('relu'))\n",
    "    DQN.add(gluon.nn.Flatten())\n",
    "    #fourth layer\n",
    "    DQN.add(gluon.nn.Dense(512,activation ='relu'))\n",
    "    #fifth layer\n",
    "    DQN.add(gluon.nn.Dense(num_action,activation ='relu'))\n",
    "\n",
    "dqn = DQN\n",
    "dqn.collect_params().initialize(mx.init.Normal(0.02), ctx=opt.ctx)\n",
    "DQN_trainer = gluon.Trainer(dqn.collect_params(),'RMSProp', \\\n",
    "                          {'learning_rate': opt.lr ,'gamma1':opt.gamma1,'gamma2': opt.gamma2,'epsilon': opt.rms_eps,'centered' : True})\n",
    "dqn.collect_params().zero_grad()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Target_DQN = gluon.nn.Sequential()\n",
    "with Target_DQN.name_scope():\n",
    "    #first layer\n",
    "    Target_DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))\n",
    "    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    Target_DQN.add(gluon.nn.Activation('relu'))\n",
    "    #second layer\n",
    "    Target_DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))\n",
    "    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    Target_DQN.add(gluon.nn.Activation('relu'))\n",
    "    #tird layer\n",
    "    Target_DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))\n",
    "    Target_DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))\n",
    "    Target_DQN.add(gluon.nn.Activation('relu'))\n",
    "    Target_DQN.add(gluon.nn.Flatten())\n",
    "    #fourth layer\n",
    "    Target_DQN.add(gluon.nn.Dense(512,activation ='relu'))\n",
    "    #fifth layer\n",
    "    Target_DQN.add(gluon.nn.Dense(num_action,activation ='relu'))\n",
    "target_dqn = Target_DQN\n",
    "target_dqn.collect_params().initialize(mx.init.Normal(0.02), ctx=opt.ctx)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replay buffer\n",
    "Replay buffer store the tuple of : `state`, action , `next_state`, reward , done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Transition = namedtuple('Transition',('state', 'action', 'next_state', 'reward','done'))\n",
    "class Replay_Buffer():\n",
    "    def __init__(self, replay_buffer_size):\n",
    "        self.replay_buffer_size = replay_buffer_size\n",
    "        self.memory = []\n",
    "        self.position = 0\n",
    "    def push(self, *args):\n",
    "        if len(self.memory) < self.replay_buffer_size:\n",
    "            self.memory.append(None)\n",
    "        self.memory[self.position] = Transition(*args)\n",
    "        self.position = (self.position + 1) % self.replay_buffer_size\n",
    "    def sample(self, batch_size):\n",
    "        return random.sample(self.memory, batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocess frames\n",
    "* Take a frame, average over the `RGB` filter and append it to the `state` to construct `next_state`\n",
    "* Clip the reward\n",
    "* Render the frames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def preprocess(raw_frame, currentState = None, initial_state = False):\n",
    "    raw_frame = nd.array(raw_frame,mx.cpu())\n",
    "    raw_frame = nd.reshape(nd.mean(raw_frame, axis = 2),shape = (raw_frame.shape[0],raw_frame.shape[1],1))\n",
    "    raw_frame = mx.image.imresize(raw_frame,  opt.image_size, opt.image_size)\n",
    "    raw_frame = nd.transpose(raw_frame, (2,0,1))\n",
    "    raw_frame = raw_frame.astype(np.float32)/255.\n",
    "    if initial_state == True:\n",
    "        state = raw_frame\n",
    "        for _ in range(opt.frame_len-1):\n",
    "            state = nd.concat(state , raw_frame, dim = 0)\n",
    "    else:\n",
    "        state = mx.nd.concat(currentState[1:,:,:], raw_frame, dim = 0)\n",
    "    return state\n",
    "\n",
    "def rew_clipper(rew):\n",
    "    if rew>0.:\n",
    "        return 1.\n",
    "    elif rew<0.:\n",
    "        return -1.\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "def renderimage(next_frame):\n",
    "    if render_image:\n",
    "        plt.imshow(next_frame);\n",
    "        plt.show()\n",
    "        display.clear_output(wait=True)\n",
    "        time.sleep(.1)\n",
    "        \n",
    "l2loss = gluon.loss.L2Loss(batch_axis=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "frame_counter = 0. # Counts the number of steps so far\n",
    "annealing_count = 0. # Counts the number of annealing steps\n",
    "epis_count = 0. # Counts the number episodes so far\n",
    "replay_memory = Replay_Buffer(opt.replay_buffer_size) # Initialize the replay buffer\n",
    "tot_clipped_reward = []\n",
    "tot_reward = []\n",
    "frame_count_record = []\n",
    "moving_average_clipped = 0.\n",
    "moving_average = 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-10-13 08:37:16,439] epis[0],eps[1.000000],durat[199],fnum=198, cum_cl_rew = 9, cum_rew = 189,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:37:24,930] epis[10],eps[1.000000],durat[218],fnum=2356, cum_cl_rew = 8, cum_rew = 168,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:37:34,955] epis[20],eps[1.000000],durat[329],fnum=4897, cum_cl_rew = 15, cum_rew = 315,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:37:46,161] epis[30],eps[1.000000],durat[356],fnum=7752, cum_cl_rew = 8, cum_rew = 168,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:37:57,165] epis[40],eps[1.000000],durat[331],fnum=10557, cum_cl_rew = 11, cum_rew = 231,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:38:07,843] epis[50],eps[1.000000],durat[316],fnum=13288, cum_cl_rew = 9, cum_rew = 189,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:38:19,090] epis[60],eps[1.000000],durat[279],fnum=16184, cum_cl_rew = 8, cum_rew = 168,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:38:29,696] epis[70],eps[1.000000],durat[337],fnum=18878, cum_cl_rew = 9, cum_rew = 189,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:38:39,332] epis[80],eps[1.000000],durat[140],fnum=21302, cum_cl_rew = 2, cum_rew = 42,tot_cl = 0 , tot = 0\n",
      "[2017-10-13 08:38:49,577] epis[90],eps[1.000000],durat[304],fnum=23894, cum_cl_rew = 7, cum_rew = 147,tot_cl = 0 , tot = 0\n"
     ]
    }
   ],
   "source": [
    "render_image = False # Whether to render Frames and show the game\n",
    "batch_state = nd.empty((opt.batch_size,opt.frame_len,opt.image_size,opt.image_size), opt.ctx)\n",
    "batch_state_next = nd.empty((opt.batch_size,opt.frame_len,opt.image_size,opt.image_size), opt.ctx)\n",
    "while frame_counter < opt.max_frame:\n",
    "    cum_clipped_reward = 0\n",
    "    cum_reward = 0\n",
    "    next_frame = env.reset()\n",
    "    state = preprocess(next_frame, initial_state = True)\n",
    "    t = 0.\n",
    "    done = False\n",
    "    \n",
    "\n",
    "    while not done:\n",
    "        previous_state = state\n",
    "        # show the frame\n",
    "        renderimage(next_frame)\n",
    "        sample = random.random()\n",
    "        if frame_counter > opt.replay_start_size:\n",
    "            annealing_count += 1\n",
    "        if frame_counter == opt.replay_start_size:\n",
    "            logging.error('annealing and laerning are started ')\n",
    "            \n",
    "            \n",
    "        \n",
    "        eps = np.maximum(1.-annealing_count/opt.annealing_end,opt.epsilon_min)\n",
    "        effective_eps = eps\n",
    "        if t < opt.no_op_max:\n",
    "            effective_eps = 1.\n",
    "        \n",
    "        # epsilon greedy policy\n",
    "        if sample < effective_eps:\n",
    "            action = random.randint(0, num_action - 1)\n",
    "        else:\n",
    "            data = nd.array(state.reshape([1,opt.frame_len,opt.image_size,opt.image_size]),opt.ctx)\n",
    "            action = int(nd.argmax(dqn(data),axis=1).as_in_context(mx.cpu()).asscalar())\n",
    "        \n",
    "        # Skip frame\n",
    "        rew = 0\n",
    "        for skip in range(opt.skip_frame-1):\n",
    "            next_frame, reward, done,_ = env.step(action)\n",
    "            renderimage(next_frame)\n",
    "            cum_clipped_reward += rew_clipper(reward)\n",
    "            rew += reward\n",
    "            for internal_skip in range(opt.internal_skip_frame-1):\n",
    "                _ , reward, done,_ = env.step(action)\n",
    "                cum_clipped_reward += rew_clipper(reward)\n",
    "                rew += reward\n",
    "                \n",
    "        next_frame_new, reward, done, _ = env.step(action)\n",
    "        renderimage(next_frame)\n",
    "        cum_clipped_reward += rew_clipper(reward)\n",
    "        rew += reward\n",
    "        cum_reward += rew\n",
    "        \n",
    "        # Reward clipping\n",
    "        reward = rew_clipper(rew)\n",
    "        next_frame = np.maximum(next_frame_new,next_frame)\n",
    "        state = preprocess(next_frame, state)\n",
    "        replay_memory.push((previous_state*255.).astype('uint8'),action,(state*255.).astype('uint8'),reward,done)\n",
    "        # Train\n",
    "        if frame_counter > opt.replay_start_size:        \n",
    "            if frame_counter % opt.learning_frequency == 0:\n",
    "                transitions = replay_memory.sample(opt.batch_size)\n",
    "                batch = Transition(*zip(*transitions))\n",
    "                for j in range(opt.batch_size):\n",
    "                    batch_state[j] = nd.array(batch.state[j],opt.ctx).astype('float32')/255.\n",
    "                    batch_state_next[j] = nd.array(batch.next_state[j],opt.ctx).astype('float32')/255.\n",
    "                batch_reward = nd.array(batch.reward,opt.ctx)\n",
    "                batch_action = nd.array(batch.action,opt.ctx).astype('uint8')\n",
    "                batch_done = nd.array(batch.done,opt.ctx)\n",
    "                with autograd.record():\n",
    "                    argmax_Q = nd.argmax(dqn(batch_state_next),axis = 1).astype('uint8')\n",
    "                    Q_sp = nd.pick(target_dqn(batch_state_next),argmax_Q,1)\n",
    "                    Q_sp = Q_sp*(nd.ones(opt.batch_size,ctx = opt.ctx)-batch_done)\n",
    "                    Q_s_array = dqn(batch_state)\n",
    "                    Q_s = nd.pick(Q_s_array,batch_action,1)\n",
    "                    loss = nd.mean(l2loss(Q_s ,  (batch_reward + opt.gamma *Q_sp)))\n",
    "                loss.backward()\n",
    "                DQN_trainer.step(opt.batch_size)\n",
    "                \n",
    "        \n",
    "\n",
    "        \n",
    "        t += 1\n",
    "        frame_counter += 1\n",
    "        \n",
    "        # Save the model and update Target model\n",
    "        if frame_counter > opt.replay_start_size:\n",
    "            if frame_counter % opt.Target_update == 0 :\n",
    "                check_point = frame_counter / (opt.Target_update *100)\n",
    "                fdqn = './data/target_%s_%d' % (env_name,int(check_point))\n",
    "                dqn.save_parameters(fdqn)\n",
    "                target_dqn.load_parameters(fdqn, opt.ctx)\n",
    "                fnam = './data/clippted_rew_DDQN_%s' %(env_name)\n",
    "                np.save(fnam,tot_clipped_reward)\n",
    "                fnam = './data/tot_rew_DDQN_%s' %(env_name)\n",
    "                np.save(fnam,tot_reward)\n",
    "                fnam = './data/frame_count_DDQN_%s' %(env_name)\n",
    "                np.save(fnam,frame_count_record)\n",
    "\n",
    "        if done:\n",
    "            if epis_count % 10. == 0. :\n",
    "                logging.error('epis[%d],eps[%f],durat[%d],fnum=%d, cum_cl_rew = %d, cum_rew = %d,tot_cl = %d , tot = %d'\\\n",
    "                  %(epis_count,eps,t+1,frame_counter,cum_clipped_reward,cum_reward,moving_average_clipped,moving_average))\n",
    "    epis_count += 1\n",
    "    tot_clipped_reward = np.append(tot_clipped_reward, cum_clipped_reward)\n",
    "    tot_reward = np.append(tot_reward, cum_reward)\n",
    "    frame_count_record = np.append(frame_count_record,frame_counter)\n",
    "    if epis_count > 100.:\n",
    "        moving_average_clipped = np.mean(tot_clipped_reward[int(epis_count)-1-100:int(epis_count)-1])\n",
    "        moving_average = np.mean(tot_reward[int(epis_count)-1-100:int(epis_count)-1])\n",
    "from tempfile import TemporaryFile\n",
    "outfile = TemporaryFile()\n",
    "outfile_clip = TemporaryFile()\n",
    "np.save(outfile, moving_average)\n",
    "np.save(outfile_clip, moving_average_clipped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot the overall performace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "num_epis_count = epis_count-2500\n",
    "bandwidth = 1000 # Moving average bandwidth\n",
    "total_clipped = np.zeros(int(num_epis_count)-bandwidth)\n",
    "total_rew = np.zeros(int(num_epis_count)-bandwidth)\n",
    "for i in range(int(num_epis_count)-bandwidth):\n",
    "    total_clipped[i] = np.sum(tot_clipped_reward[i:i+bandwidth])/bandwidth\n",
    "    total_rew[i] = np.sum(tot_reward[i:i+bandwidth])/bandwidth\n",
    "t = np.arange(int(num_epis_count)-bandwidth)\n",
    "fig = plt.figure()\n",
    "belplt = plt.plot(t,total_rew[0:int(num_epis_count)-bandwidth],\"r\", label = \"Return\")\n",
    "plt.legend()#handles[likplt,belplt])\n",
    "print('Running after %d number of episodes' %epis_count)\n",
    "plt.xlabel(\"Number of episode\")\n",
    "plt.ylabel(\"Average Reward per episode\")\n",
    "plt.show()\n",
    "fig.savefig('Assualt_DDQN.png')\n",
    "fig = plt.figure()\n",
    "likplt = plt.plot(t,total_clipped[0:opt.num_episode-bandwidth],\"b\", label = \"Clipped Return\")\n",
    "plt.legend()#handles[likplt,belplt])\n",
    "plt.xlabel(\"Number of episode\")\n",
    "plt.ylabel(\"Average clipped Reward per episode\")\n",
    "plt.show()\n",
    "fig.savefig('Assualt_DDQN_Clipped.png')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accumulated average reward after 8000 episodes of game Assault\n",
    "|![](../img/Assualt_DDQN.png)|![](../img/Assualt_DDQN_Clipped.png)|\n",
    "|:---------------:|:---------------:|\n",
    "|Average reward|Average clipped reward|"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
